Trulli
from methodspace.com

Overview

In this practical you’ll learn how to work with basic data objects and functions. By the end of this practical you will know how to:

  1. Create vectors of different types using c()
  2. Understand the three main vector classes numeric, character, and logical using class()
  3. Calculate basic descriptive statistics using mean(), median(), table() (and more!)
  4. Read and write data of various data formats using read_csv() and others
  5. Access and change vectors from data frames using $
  6. Create data.frames and tibbles using data.frame() and tibble()

Tasks

A - Getting setup

  1. Open your baselrbootcamp R project. It should already have the folders 1_Data and 2_Code.

  2. Open a new R script and save it as a new file called data_practical.R in the 2_Code folder. At the top of the script, using comments, write your name and the date. Then, load all package(s) listed in the Packages section above with library(). Make sure that each of the datasets listed above lie in your 1_Data folder.

B - Creating vectors

The table below shows results from a (fictional) survey of 10 Baselers. In the first part of this practical, you will convert this table to R objects and then analyse them!

id sex age height weight
1 male 44 174.3 113.4
2 male 65 180.3 75.2
3 female 31 168.3 55.5
4 male 27 209 93.8
5 male 24 176.7
6 male 63 186.6 67.4
7 male 71 151.6 83.3
8 female 41 155.7 67.8
9 male 43 176.1 69.3
10 female 31 166.1 66.3
  1. Create a numeric vector called id that shows the id values. When you finish, print the vector object to see it!
# Create a vector id
XX <- c(XX, XX, ...)

# Print the vector id
XX
  1. Using the class() function, check the class of your id vector. Is it "numeric"?
# Show the class of an object XX
class(XX)
  1. Using the length() function, find out the length of your id vector. Does it have length 10? If not, make sure you defined it correctly!
# Show the length of the id vector
length(XX)
  1. Create a character vector called sex that shows the sex values. Make sure to use quotation marks "" to enclose each element to tell R that the data are of type "character"! When you finish, print the object to see it!
# Create a character vector sex
XX <- c("XX", "XX", "...")
  1. Using the class() function, check the class of your sex vector. Is it "character"?

  2. Using the length() function, find out the length of your sex object. Does it have length 10? If not, make sure you defined it correctly!

  3. Using the same steps as before, create a age and height vector.

  4. Look at the weight data, you’ll notice it contains an missing value. Create a vector called weight containing these data, following the same steps as before, making sure to specify the missing value as NA (no quotation marks).

C - Functions

  1. Using the table() function, find out how many males and females are in the data. You should find 7 males and 3 females!

  2. Using the mean() function, calculate the mean age. It should be 44!

  3. Try calculating the mean value of sex. What happens? Why?

  4. Try calculating the mean weight. You should get an NA value. Why?

  5. Look at the help menu for the mean() function (using ?mean) to look for an argument that will help you with your problem.

  6. Using the correct argument for the mean function, calculate the mean weight ignoring NA values. It should be 76.89!

D - Read & write delim-separated text files

In this section, you will read in a subset of the well known diamonds data set and prepare it for data analysis.

  1. Identify the file path to the diamonds.csv dataset using the "" (quotation marks) auto-complete trick. Place the cursor between two quotation marks, hit ⇥ (tab-key), and browse through the folders. Save the file path, for now, in an object called diamonds_path.
# place cursor in-between "" and hit tab
diamonds_path <- ""
  1. Now use the Using diamonds_path insdide the read_csv() function to read in the diamonds.csv dataset. Store it as a new object called diamonds.
# read diamonds data
diamonds <- read_csv(file = XX)
  1. Print the diamonds data and inspect the column names in the header line. Something’s wrong!

  2. Fix the header by reading in the data again using the col_names-argument. Assign to col_names a character vector containing the correct column names: carat, cut, color, clarity, depth, table, price.

# read diamonds data with specified col_names
diamonds <- read_csv(file = "XX", 
                     col_names = c('name_1','name_2','...'))  # Vector of column names
  1. Re-inspect the header by printing the data. Has the header been fixed?

  2. Now pay attention to the classes of the individuals columns (variables). Have all classes been identified correctly? What about the carat column? It should be numeric, right?

  3. Let’s see what went wrong. Select and print the carat variable to identify the one entry that caused the variable to become a character vector (Hint: look for a comma between entry 10 and 20).

  4. Change the incorrectly formated entry in carat by replacing XX with the index of the incorrect value (i.e., the correct number between 10 and 20) and YY with the correct entry with a period (.) instead of a comma (,) in the code below.

# Change the value at position XX to YY
diamonds$carat[XX] <- YY
  1. Ok you fixed the value but carat is still character. We can fix it with the type_convert() function. Apply the type_convert() function to the diamonds data to have R fix all the data types. Make sure to assign the result back to diamonds so that you change the object!
# re-infer data types
diamonds <- type_convert(diamonds)
  1. Print the diamonds object and look at the column types. Has the type of carat changed to double?

  2. Write the, now, properly formatted diamonds data to your data folder as a .csv file using the name diamonds_clean.csv. Don’t forget to include both the file name and the folder (separated by /) in the character string specifying the path argument.

# write clean diamonds data to disc
write_csv(x = XX, path = "XX")
  1. Read diamonds_clean.csv back into R as a new object called diamonds_clean. Then, print the object and verify that this time the types been correctly identified from the start.

  2. The data is now ready for analysis. Explore it a bit by calculating a few statistics. For instance, what is the average carat or price (use mean())? What cut and clarity levels exist and how often do they occur (use table() on both variables)? You can learn more about the variable values from the help file ?diamonds.

E - Logical Vectors and $

  1. Logical vectors contain as values only TRUE and FALSE (and NAs). Create a new logical vector called expensive indicating which diamonds are more expensive than $10000. To do this, select the price variable from the data frame using $ use the > (greater than) operator á la vector > value.
# Create a logical vector expensive indicating
# which dimaonds cost more than 10,000

ZZ <- diamonds$XX > YY
  1. Print your expensive vector to the console. Do you see only TRUE and FALSE values? If so, do the first few values match those in the price variable?

  2. Add your expensive vector to the diamonds data frame using data_frame$variable_name <- variable. See below?

# add vector to data frame
XX$YY <- ZZ
  1. Using the table() function, create a table showing how many of the diamonds are expensive how many are not. Select the variable directly from the data frame using $.

  2. Using the mean() function, determine the percentage of the diamonds that are expensive, i.e., mean(expensive). Should this have worked?

  3. What percent diamonds weigh more than 1 carat (i.e., more than .2 gram)?

F - Read other file formats

Excel

  1. Using read_excel(), read in the titanic.xls dataset as a new object called titanic (Make sure you have alredy loaded the readxl package at the beginning of your script).
titanic <- read_excel(path = "XX")
  1. Print titanic and evaluate its dimensions using dim().

  2. Using table(), how many people survived (variable survived) in each cabin class (variable pclass)?

# determine survival rate by cabin class
table(titanic$XX, 
      titanic$XX)
  1. Using write_csv(), write the titanic dataframe as a new comma separated text file called titanic.csv in your 1_Data folder. Now you have the data saved as a text file any software can use!

SPSS

  1. Using read_spss() read in the sleep data set sleep.sav of staff at he University of Melbourne as a new object called sleep. (Make sure that you have first loaded the haven package).
XX <- read_spss(file = "XX")
  1. Print your sleep object and evaluate its dimensions using dim().

  2. How many drinks do staff at the University of Melbourne consumer per day (variable alcohol). To do this, use the mean() function, while taking care of missing values using the na.rm argument.

  3. Using the write_csv() function, write the sleep data to a new file called sleep.csv in your 1_Data folder. Now you have the sleep data stored as a text file any software can use!

SAS

  1. Using read_sas(), read in airbnb_zuerich.sas7bdat containing AirBnB listings in Zürich, Switzerland and call the object airbnb_zuerich.
# read sas data
XX <- read_sas(data_file = "XX")
  1. Print airbnb_zuerich and then evaluate its dimensions using dim().

  2. How many AirBnB listings were there of each room_type in Zürich? (Hint: use table())

  3. Using write_csv() write your airbnb_zuerich data frame to as new comma-separated text file called airbnb_zuerich.csv in your 1_Data folder.

G - Creating data frames

  1. Using the data.frame() function, create a data frame called ten_df that contains each of vectors you just created: id, age, sex, height, weight.
# Create data frame ten_df containing vectors id, age, sex, height, and weight.
XX <- data.frame(XX, XX, XX, XX, XX, XX) 
  1. Print your ten_df object to see how it looks! Does it contain all of the vectors?

  2. Using the dim() function, print the number of rows and columns in your data frame. Do you get 10 rows and 5 columns?

  3. What is the class of your ten_df object? Use the class() function to find out!

  4. Use the summary() function to print descriptive statistics from each column of ten_df

  5. Using the $ operator, print the age column from the ten_df data frame.

  6. Calculate the maximum age value from the ten_df data frame using max(). Do you get the same result from when you calculated it from the original vector age?

  7. Instead of creating a data frame of the data using the data.frame() function, try creating a tibble called ten_tibble using the tibble() function. tibbles are a more modern, leaner variant of data frame that we prefer over classic data.frame. You can use the exact same arguments you used before.

  8. Print your new ten_tibble object, how does it look different from ten_df? Try calculating the maximum age from this object. Is it different from what you got before?

X - Challenges

  1. If you take the sum() of a logical vector, R will return the number of cases that are TRUE. Using this, find out how many of the ten Baselers are male while using the is-equal-to operator ==.
# Determine the frequency of a case in a vector
sum(XX == XX)
  1. You can use logical vectors to select rows from a data frame based on certain criteria. using the following template, get the id values of Baselers who are younger than 30:
# Create a logical vector indicating which baselers are younger than 30
young_30 <- XX$XX < 30

# Print the ids of baselers younger than 30
XX$XX[young_30]
  1. Use a combination of logical vectors and the mean() function to answer the question: “What is the mean age of Baselers who are heavier than 80kg?”

  2. What are the id values of Baselers who are male and are shorter than 165cm? (Hint: You will need to use the logical AND operator & to combine multiple logical vectors)

Datasets

File Rows Columns Description
diamonds.csv 100 7 Subset of the well-known diamonds data set containing specifications and prices of a large number of recorded diamonds.
titanic.xls 1309 14 Information on the survival of titanic passengers.
sleep.sav 271 55 Survey on sleeping behavior completed by staff at the University of Melbourne.
airbnb_zuerich.sas7bdat 2392 20 Data on AirBnB listings in Zürich, Switzerland

Packages

Package Installation
tidyverse install.packages("tidyverse")
readr install.packages("readr")
haven install.packages("haven")
readxl install.packages("readxl")

Glossary

Creating vectors

Function Description
c("a", "b", "c") Create a character vector
c(1, 2, 3) Create a numeric vector
c(TRUE, FALSE, TRUE) Create a logical vector

Vector functions

Function Description
mean(x), median(x), sd(x), sum(x) Mean, median standard deviation, sum
max(x), min(x) Maximum, minimum
table(x) Table of frequency counts
round(x, digits) Round a numeric vector x to digits

Accessing vectors from data frames

Function Description
df$name Access vector name from a data frame df

Reading/writing text data

Extension File Type Read Write
.csv Comma-separated text read_csv(file) write_csv(x, file)
.csv Semi-colon separated text read_csv2(file) not available
.txt Other text read_delim(file) write_delim(x, file)

Reading/writing other data formats

Extension File Type Read Write
.xls, .xlsx Excel read_excel(file) xlsx::write.xlsx()
.sav SPSS read_spss(file) write_spss(x, file)
.sas7bdat SAS read_sas(file) write_sas(x, file)

Creating data frames from vectors

Function Description
data.frame(a, b, c) Create a data frame from vectors a, b, c
tibble(a, b, c) Create a tibble from vectors a, b, c

Examples

library(tidyverse)
library(readr)
library(readxl)
library(haven)

# Create vectors of (fake) stock data
name      <- c("apple", "microsoft", "dell", "google", "twitter")
yesterday <- c(100, 89, 65, 54, 89)
today     <- c(102, 85, 72, 60, 95)

# Summary statistics
mean(today)
mean(yesterday)

# Show classes
class(name)
class(yesterday)

# Operations of vectors
change <- today - yesterday
change # Print result

# Create a logical vector from two numerics
increase <- today > yesterday
increase # Print result

# Create a tibble combining multiple vectors
stocks <- tibble(name, yesterday, today, change, increase)

# Get column names
names(stocks)

# Access columns by name
stocks$name
stocks$today

# Calculate descriptives on columns
mean(stocks$yesterday)
median(stocks$today)
table(stocks$increase)
max(stocks$increase)


# read/write delim-separated -------------------

# read chickens data
chickens <- read_csv(file = "1_Data/chickens.csv")

# fix header of chickens_nohead.csv with known column names
chickens <- read_csv(file = "1_Data/chickens_nohead.csv",
                     col_names = c("weight", "time", "chick", "diet"))

# fix NA values of chickens_na.csv
chickens <- read_csv(file = "1_Data/chickens_na.csv",
                     na = c('NA', 'NULL'))

# write clean data to disc
write_csv(x = chickens, 
          path = "1_Data/chickens_clean.csv")

# fix types -------------------
# Note: the survey data is fictional!

# remove character from rating
survey$rating[survey$rating == "2,1"] <- 2.1

# rerun type convert
survey <- type_convert(survey)

# other formats -------------------

# .xlsx (Excel)
chickens <- read_excel("1_Data/chickens.xlsx")

# .sav (SPSS)
chickens <- read_spss("1_Data/chickens.sav")

# .sad7bdata (SAS)
chickens <- read_sas("1_Data/chickens.sas7bdat")

Resources

  • For more information on the fundamentals of object and functions in R see the R Core team’s introduction to R and for even more advanced object and function-related topics Hadley Wickham’s Advanced R.
  • For more information on reading and writing (and everything else) see Grolemund`s and Wickham’s R for Data Science.